{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missing Values\n",
    "\n",
    "When your data is floating point numbers (e.g. 2.0), then it is often easiest to let `NaN` denote missing values. In other cases (and more generally), use `missing`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "printyellow (generic function with 1 method)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using Dates\n",
    "\n",
    "include(\"printmat.jl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NaN\n",
    "\n",
    "The `NaN` (Not-a-Number) can be used to indicate that a float \"number\" is missing or otherwise strange. For other types of data that floats, you may want to use a ```missing``` (see below) instead.\n",
    "\n",
    "Most computations involving NaNs give NaN as the result.\n",
    "\n",
    "NaNs are often used to represent missing data, but works only with floating point numbers like (e.g. 2.0), and not with integers (e.g. 2)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NaN\n"
     ]
    }
   ],
   "source": [
    "println(2.0 + NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data\n",
    "\n",
    "When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z: \n",
      "     1.000       NaN\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [1.0 -999.99;\n",
    "        2.0 12.0;\n",
    "        3.0 13.0]\n",
    "\n",
    "z = replace(data,-999.99=>NaN)    #replace -999.99 by NaN\n",
    "println(\"z: \")\n",
    "printmat(z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NaNs in a Matrix\n",
    "\n",
    "If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z has some NaNs\n",
      "\n",
      "The sum of each column: \n",
      "     6.000       NaN\n",
      "\n"
     ]
    }
   ],
   "source": [
    "if any(isnan.(z))                      #check if any NaNs\n",
    "  println(\"z has some NaNs\")\n",
    "end\n",
    "\n",
    "println(\"\\nThe sum of each column: \")\n",
    "printmat(sum(z,dims=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Rid of NaNs\n",
    "\n",
    "It is a common procedure in statistics to throw out all cases with NaNs. For instance, if `z[t,:]` is the data for period $t$ and it contains one or more `NaN` values, then it is common to throw out that entire row. \n",
    "\n",
    "This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as \"missing\" - and you want to study what fund characteristics that drive performance.\n",
    "\n",
    "The code below shows a simple way of how."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z2: a new matrix where all rows with any NaNs have been pruned:\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "vb = any(isnan.(z),dims=2)    #indicates rows with NaNs\n",
    "\n",
    "z2 = z[.!vec(vb),:]           #keep only rows without NaNs\n",
    "println(\"z2: a new matrix where all rows with any NaNs have been pruned:\")\n",
    "printmat(z2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missings \n",
    "\n",
    "can be used to indicate missing values for most types (not just floats).\n",
    "\n",
    "Similarly to NaNs, computations involving `missing` (for instance, `1+missing`) result in `missing`.\n",
    "\n",
    "In contrast to NaNs, working with `missing` sometimes involves converting a traditional array to an array that can include `missing` (or the the other way). The [Missings](https://github.com/JuliaData/Missings.jl) package has help routines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Missings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z: \n",
      "         1   missing\n",
      "         2        12\n",
      "         3        13\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [1 -999;\n",
    "        2 12;\n",
    "        3 13]\n",
    "z = allowmissing(data)                #convert to an array that can include missing\n",
    "z = replace(data,-999=>missing)       #replace -999 by missing\n",
    "println(\"z: \")\n",
    "printmat(z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z has some missings\n"
     ]
    }
   ],
   "source": [
    "if any(ismissing.(z))                      #check if any NaNs\n",
    "  println(\"z has some missings\")\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z2: a new matrix where all rows with any missings have been pruned:\n",
      "         2        12\n",
      "         3        13\n",
      "\n"
     ]
    }
   ],
   "source": [
    "vc = any(ismissing.(z),dims=2)\n",
    "\n",
    "z2 = z[.!vec(vc),:]                #keep only rows without NaNs\n",
    "println(\"z2: a new matrix where all rows with any missings have been pruned:\")\n",
    "printmat(z2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once `z2` does not have any `missing` (although it still allows you to) you can typically use it as any other array. However, if you (for some reason) need to work with a traditional array, then convert `z2` (see below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The type of z2 is Array{Union{Missing, Int64},2}\n",
      "\n",
      "The type of z3 is Array{Int64,2}\n"
     ]
    }
   ],
   "source": [
    "println(\"The type of z2 is \", typeof(z2))\n",
    "\n",
    "z3 = disallowmissing(z2)       #convert to traditional array,\n",
    "                               #same as  same as convert.(Int,z2)\n",
    "\n",
    "println(\"\\nThe type of z3 is \", typeof(z3)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 1.3.0",
   "language": "julia",
   "name": "julia-1.3"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.3.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
